AUCReshaping: improved sensitivity at high-specificity

The evaluation of deep-learning (DL) systems typically relies on the Area under the Receiver-Operating-Curve (AU-ROC) as a performance metric. However, AU-ROC, in its holistic form, does not sufficiently consider performance within specific ranges of sensitivity and specificity, which are critical for the intended operational context of the system. Consequently, two systems with identical AU-ROC values can exhibit significantly divergent real-world performance. This issue is particularly pronounced in the context of anomaly detection tasks, a commonly employed application of DL systems across various research domains, including medical imaging, industrial automation, manufacturing, cyber security, fraud detection, and drug research, among others. The challenge arises from the heavy class imbalance in training datasets, with the abnormality class often incurring a considerably higher misclassification cost compared to the normal class. Traditional DL systems address this by adjusting the weighting of the cost function or optimizing for specific points along the ROC curve. While these approaches yield reasonable results in many cases, they do not actively seek to maximize performance for the desired operating point. In this study, we introduce a novel technique known as AUCReshaping, designed to reshape the ROC curve exclusively within the specified sensitivity and specificity range, by optimizing sensitivity at a predetermined specificity level. This reshaping is achieved through an adaptive and iterative boosting mechanism that allows the network to focus on pertinent samples during the learning process. We primarily investigated the impact of AUCReshaping in the context of abnormality detection tasks, specifically in Chest X-Ray (CXR) analysis, followed by breast mammogram and credit card fraud detection tasks. The results reveal a substantial improvement, ranging from 2 to 40%, in sensitivity at high-specificity levels for binary classification tasks.

to a single point on the ROC curve.The high AUC score achieved by the system does not necessarily indicate the desired performance or minimal misclassifications of one class at the chosen operating point.It is, therefore, customary to look at the shape of the curve, rather than merely going with the AUC scores when selecting the best model for deployment.
Figure 1 provides an overview of a typical abnormality detection scenario, as influenced by sensitivity and specificity.The figure illustrates multiple reader performances without assistance, along with that of a potential Computer-Aided Diagnostic (CAD) system at various classification thresholds (shaping the ROC curve).Given that one of the primary objectives of a CAD system is to reduce variation among readers in an assisted setting 14,15 , any enhancement in CAD system performance beyond the effective interval is unlikely to impact the reader's performance in such a setting.The region of interest (ROI) highlights where performance improvement is of particular importance within the effective interval.
Frequently, practical systems are mandated to achieve an almost negligible false positive rate, primarily due to the substantial misclassification costs associated with the smaller abnormal or negative class.In medical applications, misclassifying an abnormal class, is akin to erroneously labeling a pathological image as normal, potentially leading to diagnostic delays and, in severe cases, endangering patient lives.For instance, we concentrate on a Chest X-Ray abnormality classifier designed to aid radiologists by effectively filtering out a maximal number of normal Chest X-Rays (CXRs), necessitating high sensitivity at high specificity.As depicted in Fig. 1, the classification thresholds within the Region of Interest (ROI) signify the number of misclassified positive samples while keeping the misclassifications of negative samples at a consistently low level (e.g., within 20%).This effective interval varies based on the specific application and system design.In the context of our CXR systems, we emphasize the 2-5% False Positive Rate (FPR) range as our region of interest.
While rank-based metrics have been proposed as a generalized solution to enhance sensitivity at high specificity 16 , there remains ample room for improving system performance.Past research has explored strategies like using an ensemble of classifiers and amplifying the cost of misclassifications for critical classes, aiming to maximize the overall AUC score and enhance the entire ROC curve.Similar research endeavors targeting the overall AUC score improvement, include approaches that address label noise by leveraging prior anatomical information 17 , such as heart-lung segmentation and other techniques to harness the high comorbidity of diseases in CXRs.Another approach looks into incorporating uncertainty scores 18 alongside probabilistic estimates to enhance prediction robustness and accuracy.Recent research has also introduced matrix-instance-based onepass AUC optimization 19 , mitigating the need to store the entire dataset or parts of it in memory while reducing run-time costs.The kernelized online imbalanced learning (KOIL) algorithm, a component of AUC maximization research 20 , proposes a non-linear classifier designed to maximize the AUC score for large imbalanced datasets.A novel approach to maximize AUC score, that is based on sampling mini-batches of positive/negative instance pairs 21 and computing U-statistics to approximate a global risk minimization problem has also been shown to be simple, fast and learning-rate free.However, despite these advancements, these systems are not yet fully mature for deployment in the diagnostic industry 22 and do not guarantee performance within the desired ROI.
Figure 1.The effective interval delineates the region of practical significance within the ROC curve, specifically the area characterized by a False Positive Rate of less than 20% denoted as the effective interval.Enhancements beyond this region have negligible bearing on the practical performance of a commercial classification system.The region of interest denotes, the specific points on the curve where AUCReshaping is applied.
www.nature.com/scientificreports/In this paper, we, • Propose a novel evaluation metric, known as sensitivity at high specificity, designed to assess the real-world performance of deep learning systems when dealing with data-imbalanced datasets.• Introduce a mechanism to enhance the ROC-based performance metric through AUCReshaping, which is tailored for binary classification tasks.• Evaluate the robustness of this method using both medical and non-medical datasets characterized by a significant class imbalance and skewed misclassification costs.
To elaborate, the AUCReshaping function amplifies the weights assigned to misclassified samples within the Region of Interest (ROI) on the ROC curve.In our application, this ROI corresponds to the high-specificity region.This enhancement is achieved through supervised fine-tuning, ensuring that the system maximizes the detection of positive samples, while minimizing the misclassification of negative class samples (False Positive Rate: FPR).Furthermore, we initially explore the impact of the AUCReshaping function on a dataset of Chest X-Rays (CXRs).With the acquisition of millions of CXR images, the ability to confidently filter out normal images alleviates radiologists from the arduous task of parsing hundreds of images on a daily basis 14,15,23 .This, in turn, empowers them to direct their attention to critical patients, significantly expediting the diagnostic process, especially in situations akin to pandemics.These Self-Supervised Learning (SSL) based Computer-Aided Diagnosis (CAD) systems that support overburdened radiologists, not only enhance patient care but also contribute to reducing healthcare costs 24 -a particularly pressing concern today.
In light of the recent availability of extensive CXR datasets such as NHS CXR-14 25 , CheXPert 26 , or MIMIC-II 27 , the accurate annotation of CXRs remains a significant challenge.In comparison to the large Deep Learning (DL) models, trained on billions of natural images sourced from social media with crowd-labeled annotations 28 , CXR annotations are considerably more time-consuming, expensive, and necessitate the expertise of skilled radiologists.As such, this paper investigates the application of AUCReshaping in conjunction with an SSL-based abnormality classifier, with the aim of developing a scalable and pragmatic solution for CXR CAD. Figure 2 provides an overview of the experiment's general workflow, which involves the application of AUCReshaping in the fine-tuning stage on the training set.In this process, a pre-trained model undergoes fine-tuning using a subset of images from the pre-training dataset.During this phase, the AUCReshaping function identifies positive class samples that are misclassified at the high-specificity threshold, and amplifies their weights.The loss value is subsequently computed and backpropagated through the network.The high-specificity threshold determined during validation is carried over to the testing phase, where it serves as the actual classification threshold.The

Methods
The paper describes scientific research using retrospectively acquired anonymized data.The anonymization was performed in accordance with applicable laws and regulations before secure transfer was made to Siemens Healthineers for the study.Use of the data followed all the applicable license terms.The study does not involve any clinical or human subject research component to it.As such, IRB approval and informed consent are not applicable because there are no data privacy issues and no patients were impacted by the research (did not affect treatment or diagnosis).The research was conducted conforming to the appropriate scientific practices and in accordance with the relevant guidelines and regulations of the institution conducting the experiments.
In this study, we introduce a novel function known as AUCReshaping, designed to re-weight the output predictions exclusively for misclassified training samples within the Region of Interest (ROI) of the ROC curve.As previously mentioned, in our specific application, this ROI corresponds to the high-specificity region (90-98%).High-specificity thresholds on the ROC curve are defined as the operating points that result in low False Positive Rates (FPR).
The AUCReshaping method is implemented during the fine-tuning stage, where it adapts the weights for the downstream task.This re-weighting process can be likened to boosting, where the weights of selected samples from the training set are iteratively increased.When dealing with two classes, we only boost the weights of samples with the lower misclassification cost, that are misclassified at the high-specificity threshold.The magnitude of boosting is determined based on a piece-wise approximation of the average distance of misclassified positive class samples from the decision boundary.The net effect is an increase in the uncertainty of the output predictions for misclassified positive samples, specifically the false negative samples.This adjustment is made while keeping the number of false positives at a low and fixed value, as demonstrated in Table 1.It is worth noting that this algorithm is not confined to a single threshold or boosting value.Instead, a modulating adaptive boosting value can be applied at each threshold, allowing for the specification of varying costs for these samples.

AUC reshaping
The cost function corresponding to the proposed modification to the loss function is currently implemented only as a constant boosting value.This is mathematically described as where n is the scalar boosting value and theta max is the high-specificity threshold such as the classification threshold at 0.95 or 0.98 specificity.Here, y i is the target label, p i is the output predicted from the network and b i is the boost- ing value applied conditionally as shown above.In this paper, we use b i as a constant value which is applied when the positive class is misclassified at the specified boosting thresholds on the ROC curve.The final loss function is an extension of the regular cross-entropy (CE) loss, incorporating an additional bias value for the selected samples.As mentioned earlier, the boosting can be applied at various thresholds on the ROC curve, which results in the adjustment of weights assigned to the misclassified positive samples based on their proximity to the highspecificity decision boundaries.Moreover, these boosting values don't have to be uniform across all thresholds, allowing for diverse weightings of samples.This can be implemented as a discrete or continuous function.
The AUCReshaping function can be better visualized in Fig. 3.In Fig. 3a, the diagram represents the typical optimal decision boundary that maximizes both sensitivity and specificity.Here, a few misclassifications of both the red and blue samples are observed, situated on the opposite side of the decision boundary.This choice of the classification boundary is conventional in a representative DL system and is commonly used to measure system performance by default.Figure 3b showcases the decision boundary at a specific high-specificity threshold.In this case, it has been adjusted to minimize misclassifications of the red samples.Although there is a higher number of misclassified blue samples, their associated cost is significantly lower than the opposite scenario.Hence, the decision boundary is shifted accordingly, resulting in a smaller percentage of misclassified red samples.Figure 3c highlights the misclassified positive samples that receive a boost from the AUCReshaping function, indicated by the larger blue circles.The network now focuses on these samples and learns relevant features to improve their correct classification, at this specific decision boundary.By incorporating different boosting values at various n, if y i == 1 and p i < theta max 0, otherwise.thresholds, the size of the boosted blue sample circles can vary, depending on the chosen thresholds.This flexibility enables the network to concentrate on particularly challenging samples.The choice of these design parameters depends on the specific application and data distribution.Although the current implementation does not parameterize the boosting value, it could also be approximated as a weighted moving average of the mean distance of misclassified samples to the decision boundary at high specificities.
In this study, we evaluate the performance of AUCReshaping using pre-trained models based on the Momentum Contrast Encoder (MoCo) 29 and Swapping Assignments Between Views (SwAV) 30 , both of which are contrastive self-supervised learning methods.MoCo, an extension of the initial contrastive learning methodology, leverages a momentum encoder and additional data augmentations to achieve superior performance in various detection and segmentation tasks when compared to its Imagenet-supervised counterpart 29 .SwAV, on the other hand, is a self-supervised learning approach inspired by contrastive learning principles but differs in not using negative samples or pairwise comparisons between samples 30 .These publicly available pre-trained models, based on Imagenet 31 , (a dataset containing over 14 million natural images), serve as a basis for our evaluation.Additionally, we introduce a model pre-trained on 1.3 million Chest X-Rays (CXRs) 32 , which we also use for comparative purposes.
The classification task at hand is a binary classification problem, involving two categories of images.Specifically, it entails distinguishing between images related to a specific disease and other images.This task can be viewed as a detector for normal X-Rays, where X-Ray images associated with the disease are considered the negative class, and normal X-Ray images are the positive class that needs to be detected.To assess the impact of AUCReshaping, we evaluate the AUC score, sensitivity at both 95% and 98% specificity levels, on both the validation and test datasets across all our experiments.
In a practical application, following the training phase, we measure sensitivities at high-specificity levels on the validation dataset.The determined thresholds from the validation set are then used to assess sensitivity and specificity on the test set, as indicated in Fig. 2.This process yields a binary classification threshold, that achieves both high sensitivity and specificity on the test dataset.However, it's important to note that the test set specificity levels may not perfectly align with the validation set levels, which can make it challenging to determine whether any improvements are solely due to the selection of a different point on the ROC curve.To address this, we independently compare the test set sensitivities at 95% and 98%, to evaluate whether the ROC curve has indeed improved.It's worth emphasizing that achieving the right specificity level on the test data requires appropriate calibration of the validation thresholds.
The desired ROC curve is depicted in Fig. 4.We notice that the red curve may exhibit a slightly lower AUC score compared to the original orange curve.However, the performance in the high-specificity region is enhanced in the updated curve as indicated.Our proposed methodology offers a method for enhancing the operational characteristics of the deployed system by enabling improvements within a targeted segment of the ROC curve.To note, is that the ROC curve may be minimally negatively impacted in the region outside the effective interval.

Data description
The CXR dataset used in the fine-tuning stage comprises 16,953 anonymized training images and a few example images are shown in Fig. 5.The images indicate typical anonymized frontal view CXR images of patients with bounding boxes indicating the presence of pleural effusion.A subset of this dataset is used as a validation dataset, where the images are not part of the training set.The CXR dataset includes 3,663 pleural effusion, 1772 pneumothorax (ptx) images, and 14 other abnormalities.The images are stored in 10-bit DICOM format and vary in resolution, typically exceeding 2048x3096 pixels.An expert panel of radiologists provided bounding box coordinates, each associated with a disease label, and a single image could have multiple annotated boxed regions belonging to different categories.
To create image-level labels indicating the absence or presence of abnormalities, a binary coding scheme was employed (1=absent, 0=present).The training dataset remains consistent across all experiments, but the ground truth labels differ according to the specific classification task.For instance, in the first case a pleural effusion list is formed with images containing pleural effusion marked as 0 and images without marked as 1.A similar procedure is followed to create the ptx list and in the third case, a combined training list is formed, where an image with either ptx or pleural effusion is labeled as class 0, while an image without either of them is assigned class 1.This results in 3223, 1607, and 4069 positive training samples for the pleural effusion, ptx and combined classification task, respectively.The validation dataset exhibits a similar pattern, containing 440, 165, and 466 positive samples, as indicated in Table 2.There could be multiple images of a patient in the training dataset.
A similar procedure is applied to generate the test datasets.The pleural effusion test dataset comprises 331 images, including 74 pleural effusion images and 257 images featuring other abnormalities.In the case of pneumothorax (ptx), a test dataset with 412 images is established, containing 154 ptx images and 258 non-ptx images.Additionally, a combined test dataset is created, incorporating 228 images that exhibit either "pleural effusion or ptx" and 294 images without either of these diseases, resulting in a total of 522 test images.The validation dataset is a subset of the training dataset, where the class-wise ratio is maintained.Particular care is taken to  ensure that no images from the test set patients are included in either the training or validation set.The complete data split is depicted in Table 2.The AUCReshaping methodology is further evaluated using the VinDr-Mammo dataset, an extensive benchmark dataset designed for full-field digital mammography 33,34 .This dataset encompasses a total of 5000 full-view examinations, including craniocaudal (CC) and mediolateral oblique (MLO) views, thus comprising a total of 20,000 images.Notably, the dataset features consensus annotations provided by three experienced radiologists, particularly for assessments based on the Breast Imaging Reporting and Data System (BI-RADS) and breast density categorization 35 .
Within the VinDr-Mammo dataset, a training set of 16,000 images is available.This training set is divided into two subsets: a training subset comprising 13,600 images and a validation subset containing 2400 images.The test set consists of the original 4,000 images from the dataset.These images are acquired from various mammography systems, including those from manufacturers such as Siemens, Planmed, and IMS.The selection of images for both the training and validation subsets is made at the patient level, with meticulous attention to ensuring an even distribution of samples based on their class categories (BI-RADS and breast density).This careful curation aims to mitigate potential sources of bias in the experiments.
Our study leverages this dataset to explore two distinct classification tasks: breast density categorization and BI-RADS classification.In the context of breast density categorization, the four breast density categories, ranging from A (almost entirely fatty) to D (extremely dense) 35 , are further sub-categorized into two classes.The positive class encompasses non-dense breasts (categories A and B), while the negative class comprises dense breasts (categories C and D) 36,37 .In the training, validation, and test splits, we have 1380, 228, and 400 samples for the positive class, and 12,220, 2172, and 3600 samples for the negative class, respectively.Additionally, the BI-RADS assessment scores are categorized into two classes based on the BI-RADS management recommendations for tissue diagnosis 35 .The positive class is composed of images with BI-RADS scores of 4 and 5, while the negative class includes images with BI-RADS scores of 1, 2, and 3. Across the training, validation, and test sets, we have 656, 134, and 198 images in the positive class, and 12,944, 2266, and 3802 images in the negative class, respectively.Notably, both tasks involve the positive classes serving as the minority abnormality class, entailing a high misclassification cost.Furthermore, it is observed that data imbalance is more pronounced in the BI-RADS task compared to the breast density task.
For illustrative purposes, Fig. 6 showcases breast mammogram images from a 42-year-old patient within the VinDr-Mammo dataset 33 .These images display the R-CC, R-MLO, L-CC, and L-MLO views, presented from left to right.Both breasts exhibit a breast density category of D, with the right breast (R) assigned a BI-RADS score  www.nature.com/scientificreports/ of 2, and the left breast (L) designated with a BI-RADS score of 5.These ratings are attributed to the presence of suspicious calcifications and a mass, as indicated within the bounding box.
To further validate the robustness of this method in non-medical domains, we applied AUCReshaping to the domain of credit card fraud detection.The dataset used comprises transactions made by European cardholders and spans over a two-day period [38][39][40][41][42][43][44][45] .Within this dataset, 492 transactions are fraudulent, while 284,807 are considered normal transactions, accounting for only 0.172% of all transactions.The dataset was divided into training and test sets, with 227,454 and 56,861 positive samples in each, and 391 and 101 negative samples, respectively.The total number of images in the datasets along with the respective number of positive samples (enclosed in brackets) are visually represented in Table 2.

Implementation details
In the CXR experiments, images are first resized to 256x256 pixels before being input into the fine-tuning stage.These experiments are executed on 8 GPUs with a batch size of 64, employing PyTorch and Distributed Data-Parallel computing (DDP).The models are optimized using the Adam 46 optimizer, initialized with a learning rate of 0.01, weight decay set at 1e-5, and a Cross-Entropy (CE) loss function.Class weights are determined based on the class ratio within the training dataset 30 .Distributed weighted sampling is adopted to over-sample the negative classes to address class imbalance and ensure smooth data loading across the GPUs.
The number of repeated measurements follow a negative binomial distribution.The number of repeated measurements needed for a high-enough statistical power (>0.98) is calculated using the technique described in 47 known as "marginal benefit".Based on the statistical power, top 5 models were averaged for all experiments.The baseline experiments were repeated 8-10 times, and the rest are repeated only five times to calculate the average.In the end-to-end (e2e) fine-tuning process, the training loss gradually diminishes until overfitting occurs.Consequently, the epoch with the lowest validation loss, often occurring within 10 epochs, is selected.As mentioned earlier, a range of pre-trained models are assessed.
Augmentations are typically applied only in the training stage and not in the validation or testing stage.Working with DICOM images necessitates the application of a distinct set of augmentations.Given the higher precision of the input data, the built-in PyTorch transforms are not directly applicable and must be re-implemented to suit the data format.Augmentations like random intensity scaling and horizontal flipping are utilized, in addition to a custom normalization process for the DICOM images.This normalization includes histogram equalization and dynamic window scaling 32 .
For the breast density experiments, the images are resized to 256 × 256 pixels before they are input into the fine-tuning network.In the case of DICOM images, pre-processing is carried out initially to remove the CC and MLO view labels.These images contain labels indicating the right or left breast, along with the corresponding CC or MLO view superimposed on the image.These labels are first cropped, and fundamental array operations are employed to remove the black background.Subsequently, the pixel values are re-scaled using the window level and width from the DICOM data, resulting in images saved in PNG format with standard aspect ratios and sizes.Unlike CXRs, mammogram images employ different augmentations, specifically random equalization and random erasing, with low probabilities.Other network hyper-parameters align with those used in the CXR experiments, and the results represent the average of 5 runs.Evaluation of the classification results is performed using a SwAV-based Imagenet pre-trained model.
For the credit card fraud classification, a straightforward Multi-layer Perceptron (MLP) Network is employed, comprising a combination of linear, ReLU, and dropout layers with a dropout probability of 0.5.Experiments are executed on a single GPU, with a batch size of 100, and a simple SGD optimizer is utilized.To address the data imbalance, positive_weights of 5 are applied in conjunction with the BCEWithLogitsLoss function.The model undergoes training for 200 epochs until convergence of the training loss, and the results represent the average of 5 independent runs.

Results
The models used to test AUCReshaping on CXRs are called "SSL Imagenet SwAV pre-trained", "SSL Imagenet MoCo pre-trained" and "SSL CXR_1million SwAV pre-trained" respectively in the Tables 3, 4 and 5.All of the experiments are conducted with ResNet-50 architecture base models.The CXR experiments are conducted for two different abnormality classifications, specifically targeting pleural effusion and pneumothorax.These classifications include pleural effusion versus non-pleural effusion, ptx versus non-ptx, and a combined dataset for normal versus abnormal classification, where "normal" denotes images without either pleural effusion or ptx.Pleural effusion is relatively easy for the classifier to detect, resulting in higher baseline AUC scores, while ptx poses a more challenging detection task 48 .These two categories represent opposite ends of the difficulty spectrum and collectively provide a comprehensive assessment of the system's performance in various abnormality detection scenarios.

Pleural effusion
Table 3 displays the outcomes of employing the AUCReshaping function with various boosting values applied at specific thresholds.The sections in the table represent results associated with different pre-trained models, while the rows represent outcomes from different experiments.In the first column, we present the boosting values applied, and in the second column, the corresponding thresholds from the ROC curve during training.For e. g., the configuration [0.4,0.2, 0.1, 0.1] at [0.90, 0.92, 0.95, 0.98] signifies that a boosting value of 0.4 is applied at the 0.90 threshold, 0.2 at the 0.92 threshold, and 0.1 at both the 0.95 and 0.98 thresholds.These boosting values are employed to modify the loss values or output logits at multiple boosting thresholds.The sensitivity scores with different pre-trained models are compared, with the highlighted rows indicating the specific boosting values   Based on the experiments, it has been observed that the overall AUC score remains relatively stable, with the possibility of a slight decrease, (up to 5%), that depends on the extent of correction applied.Notably, a significant drop in the AUC score can have a detrimental impact on high-specificity sensitivities.It's interesting to note that fine-tuning only the frozen ResNet-50 layers, leads to lower baseline AUC scores, but can result in higher improvements (from the AUCReshaping function), when compared to their corresponding end-to-end (e2e) fine-tuning experiments.Moreover, the percentage of improvement, at either discrete observation point can vary based on the specific boosting thresholds on the ROC curve.
The experiments aimed to identify the most effective method for boosting misclassified instances, whether through a constant value or an average of the output predictions.It was observed that using the average of the output predictions from the misclassified samples did not yield meaningful results.A more logical approach is to base the boosting criteria, on the distance of the average of the misclassified positive samples to the highspecificity decision boundary.Implementing a weighted moving average of this distance, appears to be the most appropriate metric for determining the boosting value.While a comprehensive grid search of all possible values was not conducted, preliminary experiments suggest that there is a "sweet spot" for the boosting value.Excessive boosting can lead to a drop in AUC and severe degradation in sensitivities.
In this context, Table 3 provides valuable insights, demonstrating a notable improvement of approximately 5% in sensitivity at 98% specificity when retraining all layers of the Imagenet SwAV pre-trained model.The improvements are even more substantial in the CXR_1million model, but this can be attributed to the baseline results not being close to saturation, leaving more room for improvement.

Pneumothorax
The experiments were also replicated for the ptx versus non-ptx classification, where ptx was considered the negative class, and non-ptx was regarded as the positive class.This classification problem presents a greater Table 5. 2-disease normals versus abnormal with sensitivities at high specificities on CXR_16k when fine-tuning e2e.The % change is from the preceding row compared to the corresponding baseline data.The asterisk (*) signifies statistical significance (p< 0.05) compared to the baseline.The bolded rows are the results indicating best or highest performance.The italics is the positive % change of the previous row when compared to baseline and the bold italics is the negative % change of the previous row when compared to baseline.www.nature.com/scientificreports/challenge for the model due to the smaller regions of ptx that could be missed, when resizing the images to 256 × 256.This complexity is reflected in the lower baseline AUC values, with AUC at 0.93 compared to the 0.98 AUC for pleural effusion, as evident in Table 4.
The results for ptx classification presented in Table 4 demonstrate a 16% improvement in sensitivity at 98% specificity when using an Imagenet SwAV pre-trained model.Additionally, the highlighted results are statistically significant with p< 0.05.In contrast, the results obtained with the CXR_1million pre-trained model show a 5% decrease at the same point.This variation can be attributed to the choice of optimizing the 95% specificity as the operating point which indicate a statistically significant 12.69% increase with p< 0.5.
The experiments indicate that different boosting values can lead to varying levels of improvement, and determining the optimal boosting strategy falls outside the scope of this paper and is subject to system design considerations.As observed in previous experiments, the best results have been consistently achieved with the Imagenet pre-trained SwAV models.

Pleural effusion + pneumothorax
The final set of CXR experiments involved a combined dataset, where the negative class consisted of all images with either pleural effusion or pneumothorax, and the positive class contained images with any other abnormality disease.The results are summarized in Table 5. Notably, improvements ranging from 2 to 12% were observed across various pre-trained models when retraining all layers.Additionally, the results using MoCo 49 based Imagenet pre-trained models were compared, and the corresponding AUC scores were documented in Table 5.Of note, the MoCo-based Imagenet pre-trained model exhibited the highest AUC scores, when compared to the SwAV-based Imagenet pre-trained model and the SwAV-based CXR_1million pre-trained model.Here, the AUC score exhibited a marginal decrease of 0.23% when fine-tuning end-to-end, but the sensitivity showed improvements ranging from 3 to 18% at various points on the ROC curve.The asterisk indicates the statistically significant results with p< 0.05, and we see the MoCo based models, demonstrate the maximum improvement.The behavior of the AUC value aligns with expectations, given that the modified algorithm does not explicitly optimize the overall AUC in the loss function.

Mammogram classification
The results presented in Table 6 pertain to two distinct classification tasks: breast density and BI-RADS category classification, both employing a SwAV-based Imagenet pre-trained model.For breast density classification, the baseline AUC score is notably high, registering at 0.97.Consequently, the sensitivity at 95% and 98% specificity is already above 0.8, resulting in only a marginal increase in sensitivity at both thresholds.
In contrast, the AUC score for the BI-RADS classification task is considerably lower, resting at 0.77.This leads to lower sensitivities at high-specificity levels, namely 0.18 and 0.08.Significant improvements are observed in this scenario, with a 19.64% increase in sensitivity at 95% specificity, and an impressive 42.33% increase at 98% specificity.These improvements elevate sensitivity from 0.18 to 0.2 at 95% specificity and from 0.058 to 0.09 at 98% specificity.Again, the asterisk indicates the statistical significance of improvement over the baseline with p< 0.05.

Credit card fraud classification
Similarly, the results in Table 7 underscore the effectiveness of AUCReshaping in the context of credit card fraud detection.Here, the baseline AUC score of 0.98 is considered excellent and approaches the current gold standard.Despite this, it still translates to a notable number of false positive misclassifications-normal transactions falsely labeled as fraudulent-when striving to maximize fraudulent classifications, i.e., TNR.We examine the sensitivities at much higher thresholds than before, considering nearly 0 fraudulent transactions are allowed to be misclassified.At the 99% and 99.99% specificity thresholds, the baseline sensitivities are measured at 0.73 and 0.33, respectively.However, after applying AUCReshaping, significant improvements are observed, with a 3.05% increase in sensitivity at 99% specificity and an encouraging 27.4% increase at 99.99% specificity.This translates to almost 10,000 fewer misclassifications of normal credit card transactions, as fraudulent transactions at the 99.99% specificity threshold.These results underscore the potential of AUCReshaping in reducing false positives and improving the sensitivity of fraud detection systems.In Table 7 the asterisk denotes a a statistically significant improvement over the baseline with p< 0.1

Extension to multi-class and multi-label
We extend the application of our metric to the multi-class scenario, wherein each sample is associated with a single class, and the sum of predicted probabilities always equals 1.Our investigation focuses on the effect of AUCReshaping within the context of the BIRADS classification task.Specifically, we reshape the ROC curve for the BIRADS-5 category, representing the most severe malignant class.Notably, we observe a substantial enhancement of 35% in sensitivity at a specificity of 98%.It's important to note that our study primarily addresses AUC as a one-versus-all metric for BIRADS-5, and the macro average AUC and accuracy may be negatively impacted.To refine our method, we introduce modifications aimed at augmenting the weights of misclassified samples and diminishing the weights of the highest probability within the remaining four categories, ensuring that the total probabilities sum to 1.The results as denoted in Table 8, show a 0.66% improvement in the AUC score for BIRADS-5 as well as a 36.42%improvement in the sensitivity at 98% specificity.The results demonstrate the potential of AUCReshaping in a multi-class scenario as well.For future work, we propose to showcase these enhancements and anticipate conducting a more extensive and meticulous investigation encompassing the combination of multiple classes (such as reducing the misclassification rate of both BIRADS-4 and BIRADS-5 together) and the assessment of overall classification accuracy, macro average AUC and F1 scores.
In the multi-label scenario, samples may be associated with multiple labels, exemplified by cases involving the presence of both pleural effusion and pneumothorax within the same image.In such instances, the treatment of ROC curves per label independently, aligns with the binary classification methodology employed in our study.

Discussion and conclusion
The evaluation of deep learning systems on heavily imbalanced datasets often requires practical metrics beyond the traditional AUC score to gain a more comprehensive understanding of system performance.In response to this need, we introduce a novel metric known as "sensitivity at high-specificity" This metric is designed to assess the performance of deployed systems and enables the optimization of system performance at specific operating points along the ROC curve.To further enhance system performance, we present a new algorithm called AUCReshaping.This algorithm is applied iteratively during the fine-tuning phase when adapting pre-trained models to domain-specific classification tasks.AUCReshaping aims to improve sensitivity at high-specificity thresholds, addressing the challenges posed by imbalanced datasets and providing a more practical and effective approach to evaluating and optimizing deep learning systems.
In this paper, we explore the effectiveness of AUCReshaping on three distinct datasets: Chest X-Rays (CXRs), breast mammograms, and credit card fraud detection.Our CXR experiments encompass three classification tasks: pleural effusion versus non-pleural effusion, pneumothorax (ptx) versus non-ptx, and a combined classification for images containing either pleural effusion or ptx as a negative class.To perform these tasks, we leverage a range of pre-trained models, including Imagenet-based models such as SwAV and MoCo, as well as an internally pre-trained model, referred to as CXR_1million.The general workflow includes a pre-training and fine-tuning stage, where fine-tuning is carried out through two methods: end-to-end (e2e) training or with frozen ResNet-50 weights, with a focus on the end-to-end retraining, which result in higher overall performance.We assess classifier performance by comparing the sensitivity at 95% and 98% specificity on test datasets, utilizing various models for evaluation.
Our breast mammogram experiments address two classification tasks: breast density (A,B vs. C,D) and BI-RADS category (1,2 vs. 3,4,5), both utilizing the Imagenet pre-trained model based on SwAV.In the former case, density categories A and B serve as the positive class, while in the latter, BI-RADS categories 1 and 2 are considered the positive class.Fine-tuning is executed end-to-end as before, to evaluate metric improvements at 95% and 98% specificity.Lastly, our credit card fraud detection involves a straightforward Multi-Layer Perceptron (MLP) network.Sensitivities are evaluated at much higher specificities of 99% and 99.99%.
The implementation of AUCReshaping has shown notable improvements in sensitivity ranging from 2 to 40% across various test datasets.The extent of enhancement is contingent on factors such as the level of reshaping or boosting applied in the high-specificity region of the ROC curve and the baseline AUC scores.Importantly, these improvements exhibit a high degree of generalizability as they are demonstrated in the context of various diseases, domains, and with different pre-trained models.To further advance the state-of-the-art (SOTA) in terms of AUC scores and related metrics, it is possible to explore alternative architectures such as DenseNet or Vision-Transformer.By doing so, we can aim to achieve even better results and push the boundaries of performance in these classification tasks.It is worth noting that while boosting the output predictions can lead to substantial improvements, excessively high boosting values may perturb the output predictions to a degree that harms the overall network performance.As a result, identifying the optimal boosting value becomes a critical design consideration to maximize the performance benefits.
The AUCReshaping method offers a high degree of flexibility, as it can be applied at varying levels of granularity, depending on the specific requirements of the application.The selection of optimal thresholds and boosting values is highly dependent on the characteristics of the dataset and the particular design choices made.Indeed, the choice between retraining all layers during fine-tuning and fine-tuning only the linear classifier, has a significant impact on the AUC score improvement.Retraining all layers often results in higher AUC scores, but as the sensitivity at high-specificity for the baseline is lower in the latter case, a greater improvement is observed in the high-specificity sensitivities.
The choice of boosting strategy is also a crucial consideration.Whether to use a constant boosting value for all thresholds or weighted boosting for each threshold, can lead to different levels of improvement or changes in performance.It's essential to strike a balance and avoid excessive boosting values, especially at thresholds far removed from the desired operating point, as this can lead to unexpected drops in performance.For e. g., if the improvement is desired at 98 or 99% specificity, applying the reshaping algorithm at the 60-80% range will most definitely cause the performance at 98% to drop.The distribution of the feature space within the manifold varies between different contrastive learning algorithms, resulting in similar AUC scores but distinct sensitivities at high specificities.The divergence in feature learning contributes to differences in output probabilities and decision boundaries among various models.
This research represents a significant step toward the ultimate goal of developing practical deep learning systems capable of optimizing binary classification tasks on heavily imbalanced datasets with significant misclassification costs.Ongoing research efforts may extend these ideas to combinations of multi-class classification tasks, while improving the global accuracy and explore approaches to address high-specificity misclassifications in the pre-training stage.The hope is that future research will continue to advance point-wise performance on ROC curves while also introducing practical metrics to assess the performance of such systems alongside AUC scores.AUCReshaping technique.The study uses both data acquired from the resources that are publicly available as well as the data acquired for the project through collaboration and procurement agreements.The public data can be obtained using online resources and can be used under the applicable license guidance.Proper references are provided in the manuscript for the public data resources.The internal data acquired for the project can be available from the corresponding author upon a reasonable request.The Breast Mammogram dataset represents a substantial publicly available resource consisting of full-field digital mammography data 33,34 , which encompasses both BI-RADS assessments and annotations pertaining to abnormalities.This dataset comprises a total of 20,000 DICOM images, which correspond to 5000 distinct mammography examinations conducted between the years 2018 and 2020.The selection of these images was performed through random sampling, with a strict de-identification process being applied.Patient metadata, including details such as manufacturers and manufacturers' model names, is included in the dataset, but notably, it excludes individuals aged 89 years or above.To ensure complete de-identification, patient-specific information present within the images is removed through predefined rectangular cropping.Furthermore, the de-identification process involves a meticulous review by two independent evaluators to ascertain the thorough removal of any patient-related data.The Breast mammogram dataset can be requested at https:// github.com/ vinbi gdata-medic al/ vindr-mammo.The credit card dataset has been collected and analyzed through a research collaboration of Worldline and the Machine Learning Group 50 of the Université Libre de Bruxelles (ULB) on big data mining and fraud detection.The credit card dataset can be downloaded from https:// www.kaggle.com/ datas ets/ mlg-ulb/ credi tcard fraud?resou rce= downl oad.

Figure 2 .
Figure 2. The schematic provides an overview of the experimental methodology, highlighting the AUC reshaping function's role in reshaping a specific portion of the ROC curve, included in the "region of interest." The ROC figures depicted in the diagram illustrate the adjustments made in the high-specificity region, with a slight potential decrease in the overall AUC value, resulting from modifications in the remaining parts of the curve.

Figure 3 .
Figure 3. Illustration of the process of the AUCReshaping() function during fine-tuning.In each iteration, the function is applied to increase the weights of the misclassified samples at the high-specificity threshold.This process is repeated at multiple thresholds with different weighting values.(a) shows the optimal threshold that separates positive (blue) samples from negative (red) samples.(b) demonstrates a high-specificity threshold that aims to reduce the misclassifications of negative (red) samples.(c) represents the re-weighting of high-specificity misclassified positive samples (blue), which increases the uncertainty in the model's predictions.

Figure 4 .
Figure 4.The original ROC curve of an SSL system (shown in orange) is transformed by AUCReshaping, resulting in the new ROC curve (depicted in red).While the final AUC score may experience a marginal reduction, it's evident that this transformation leads to higher sensitivity at high-specificity levels.

Figure 5 .
Figure 5. Example Chest X-Rays, sourced from the internal dataset CXR_16k, display pleural effusion abnormalities identified by bounding boxes.

Figure 6 .
Figure 6.Breast mammogram images featuring bounding boxes highlighting calcification and a mass abnormality, as sourced from the VinDr-Mammo dataset.
https://doi.org/10.1038/s41598-023-48482-xwww.nature.com/scientificreports/and thresholds that lead to the highest improvements.The asterisk indicates results with statistically significant improvements at p< 0.05.Colored values represent the percentage change when comparing a given row to its corresponding baseline.The skewness and kurtosis for all the measured sensitivity values is reported in brackets.

Table 2 .
List of datasets used for classification studies in this paper.

Table 3 .
Pleural effusion versus Non-pleural effusion sensitivity at high-specificity values on CXR_16k before and after reshaping the ROC curve when all layers are retrained (e2e fine-tuning).The % change is from preceding row compared to corresponding baseline data.The asterisk (*) signifies statistical significance (p< 0.05) compared to the baseline.The bolded rows are the results indicating best or highest performance.The italics is the positive % change of the previous row when compared to baseline.

Table 4 .
Pneumothorax versus Non-pneumothorax Sensitivity at high-specificity values on CXR_16k before and after reshaping the ROC curve, fine-tuned e2e.The % change is from the preceding row compared to corresponding baseline data.An asterisk (*) signifies statistical significance improvement over the baseline (p< 0.05).The bolded rows are the results indicating best or highest performance.The italics is the positive % change of the previous row when compared to baseline and the bold italics is the negative % change of the previous row when compared to baseline.

Table 7 .
Credit card fraud classification of normal versus fraudulent transactions.The asterisk (*) signifies statistical significance (p< 0.10) compared to the baseline.The bolded rows are the results indicating best or highest performance.The italics is the positive % change of the previous row when compared to baseline.

Table 8 .
Multi -class classification results of BIRADS-5 category in breast mammogram classification.The ROC curve of only category 5 as the critical class is reshaped.The asterisk (*) signifies statistical significance (p< 0.05) compared to the baseline.The bolded rows are the results indicating best or highest performance.The italics is the positive % change of the previous row when compared to baseline and the bold italics is the negative % change of the previous row when compared to baseline.Vol.:(0123456789) Scientific Reports | (2023) 13:21097 | https://doi.org/10.1038/s41598-023-48482-xwww.nature.com/scientificreports/